Recalibration of a Deep Learning Model for Low-Dose Computed Tomographic Images to Inform Lung Cancer Screening Intervals

Key Points Question Can a deep learning algorithm reuse a lung screening low-dose computed tomographic image to safely assign individuals with nodules to 1- vs 2-year screening? Findings Using data from 10 831 low-dose computed tomographic images with lung nodules from patients in the National Lung Screening Trial, this diagnostic study showed that a recalibrated deep learning algorithm was able to predict lung cancer diagnosis at a 1-year screen with good discrimination. The deep learning algorithm outperformed current American College of Radiology guidelines and statistical models. Meaning These findings suggest that deep learning algorithms could be used to assign individuals to 2-year screening intervals, reducing the harms of screening and potentially making screening cost-effective in some health care systems.


Linked-Year Method for Classification of Screen-Detected Lung Cancers
This information is reported in the supplement to Kovalchik et al (1): The classification of screen-detected lung cancers was based on the results of the diagnostic follow-up occurring within one year of a linked screen. A screen link started at the screen (T0, T1, or T2) and extended forwards from the end of the current diagnostic chain to the next event. An event could be another procedure, a lung cancer diagnosis, or another screen. If there was no next event, the next event was a screen, or the next event occurred more than 12 months after the current end of the chain, the chain ended. If the next event was a lung cancer within one year, then that cancer was considered screen-linked. Otherwise the next event was another procedure within one year, and that became the new end of the chain and the process repeated.
A positive screen or report of a lung cancer diagnosis triggered the completion of a Diagnostic Evaluation form, where information on procedures and lung cancer diagnoses were captured. The study attempted to collect medical records for any follow-up a participant sought following a positive screen, for up to one year after the screen, or up to two years for nodules found on T2 screens that were either newly detected or showed growth from previous screens. Follow-up beyond one year (or two years for T2 screens) could be collected if the screening center determined that the follow-up was prompted by the screen. If the trial learned of a lung cancer diagnosis not resulting from an NLST screen (usually from either a participant self-report on the Annual Study Update or from a death certificate), the trial attempted to collect records back to whatever non-trial exam or initial presentation with symptoms led to the diagnosis.

Review of the Lung Cancer Risk Assessment Tool (LCRAT)
For all details and parameter estimates, see (2). Briefly, the LCRAT estimates a person's risk (usually 5-year risk, but in this paper, we calculate 1-year risk to match the 1-year screening intervals in the NLST) based on their demographics (age, race, gender, education), smoking (duration, intensity, and quit-years), and other lung-cancer risk factors (BMI, emphysema, and first-degree family history of lung-cancer). The LCRAT consists of two Cox sub-models, one for risk of lung-cancer and the other for the competing-risk of death. The LCRAT was fit in the PLCO control arm and has been validated in the chest-radiography arms of the PLCO and NLST, as well as the NIH-AARP cohort, and the ACS CPS-II cohort (2).

The LCRAT+CT model
For all details, see (3). Briefly, LCRAT+CT is a discrete-time Markov log-binomial risk model for 1-year risk of lung-cancer (4). LCRAT+CT first calculates 1-year pre-screening risk using the Lung Cancer Risk Assessment Tool (LCRAT) (2). We recently extended LCRAT+CT to predict next-screen risk after a non-malignant abnormal CT screen (3). LCRAT+CT calculates risk as the LCRAT 1-year pre-screening risk raised to an exponent, where the exponent is calculated as the sum of the regression coefficients corresponding to features of the abnormal CT. The LCRAT+CT was originally fit to 12,993 non-malignant abnormal CT-screens in the NLST, where 235 lung-cancers were detected in 1-year (3), and we use those estimates. As a sensitivity analysis, we re-fit the model to the 10,831 abnormal screens in the NLST that also have an LCP-CNN score (195 lung-cancers detected in 1-year), to ensure comparability of comparisons with LCP-CNN. As with LCP-CNN, for each nodule we used an 8-fold cross-validated LCRAT+CT score that was not fit to that nodule, using the same folds as for LCP-CNN. The re-fitted LCRAT+CT showed good cross-validated internal-calibration (195 cases observed vs. 199.2 predicted, p=0.77) and discrimination (optimism-corrected AUC=0.77).

The Lung Cancer Prediction Convolutional Neural Network (LCP-CNN)
We used an AI developed for discrimination of malignant and benign nodules, to evaluate whether it could improve a clinical model for the prediction of a screening patient's likelihood of developing cancer over the upcoming year. Optellum's LCP-CNN is an AI designed to distinguish benign lung nodules from malignancies using only low-dose CT data, and without any assistance from patient clinical factors such as age or smoking history. They trained their AI using specially curated data from the NLST, and validated it on external (non-screening) sources from several other sites (5,6). The LCP-CNN computes a score from 0-100 to each nodule it analyses, where the analysed nodule must be indicated in 3D on the CT by a clinician. It is thus incapable of scoring a CT on which a nodule does not appear, such as those which are purely negative findings in the NLST.
We have obtained a subset of the cross-validated NLST LCP-CNN scores described in Massion et al (5) directly from Optellum for this work. While the original data described in Massion et al contains one score per nodule per screening study, this current work requires only one score per screening round, so only the highest-scoring nodule's score was used for any given patientscreen. This dataset only includes patients with nodules listed in the NLST metadata as being 5mm or over, and explicitly excludes nodules classified as being ground glass opacities (GGOs), because those criteria were part of the initial data selection protocol described in Massion et al. It was also incapable of scoring nodules at the very periphery of a CT (without full support for a bounding box of the size fed into the AI), so a small number of CTs were missing scores for that reason. As well as reviewing all CTs on which one or more nodules was recorded, a medical doctor or medical student, under expert supervision from University of Oxford Radiologists, reviewed all CTs of patients recorded as having developed lung cancer, and fully reviewed and extended their mark-up and metadata. Additional nodules not listed in the NLST metadata were also added as long as they were not fully calcified (since fully calcified nodules were not considered "positive findings" in the original NLST data). Finally, this data represents the eight cross-validation splits described in Massion et al, so is technically a set of scores from eight different AIs, all trained with a different combination of patients on the same task.

Recalibrating the LCP-CNN to predict 1-year lung cancer risk
The LCP-CNN predicts risk of immediate malignancy given nodule image features. We recalibrated the LCP-CNN score to predict 1-year lung cancer risk following a non-malignant abnormal screen by fitting a logistic regression model where the LCP-CNN score is the sole covariate. When multiple nodules on a CT image have an LCP-CNN score, we considered only the nodule with the highest LCP-CNN score. The model is The negative intercept reflects on the much lower absolute risks attainable for 1-year prediction versus immediate prediction of malignancy. We use the logit of the LCP-CNN score as the covariate. The OR is exp(0.66)=1.93 per unit logit increase in the LCP-CNN score (p<<0.0001). Thus, although the LCP-CNN score is meant to predict immediate malignancy, it is also strongly predictive of 1-year lung-cancer risk for those with a non-malignant abnormal screen.
We used 8-fold cross-validation to check the calibration of the model, using the same folds as LCP-CNN was developed using, i.e. we fit the model to 7/8 of the data to make a prediction for the remaining 1/8. In this way, no observation contributed to the model fit used to make a prediction for it. Of the 195 lung-cancers, the model predicted 195.62 (p>0.9), indicating good calibration for predicting 1-year lung-cancer risk. The recalibrated LCP-CNN score had optimism-corrected AUC=0.87.

Combining LCRAT and the LCP-CNN score: The LCRAT+LCPCNN model
We combined the LCRAT and LCP-CNN scores by including them as covariates (on the logit scale) in a logistic regression model for 1-year lung-cancer following a non-malignant abnormal screen. When multiple nodules on a CT image have an LCP-CNN score, we considered only the nodule with the highest LCP-CNN score. The model is: The LCRAT score has OR of exp(0.39)=1.5 per unit logit increase in the LCRAT risk. The increased OR for the LCRAT score reflects on the fact that having a non-malignant abnormal screen increases subsequent risk of cancer. Including the LCRAT covariate improves the model fit versus only using the LCP-CNN score (p=0.001).
We used 8-fold cross-validation to check the calibration of the model, using the same folds as LCP-CNN was developed using, i.e. we fit the model to 7/8 of the data to make a prediction for the remaining 1/8. In this way, no observation contributed to the model fit used to make a prediction for it. Of the 195 lung-cancers, the model predicted 195.62 (p>0.9), indicating good calibration for predicting 1-year lung-cancer risk.
The LCRAT+LCPCNN score had optimism-corrected AUC=0.87, very similar to the recalibrated LCP-CNN. Thus, although including LCRAT improves the model, the LCP-CNN score suffices to powerfully predict 1-year lung-cancer risk. This finding suggests that some nodules are premalignant and become malignant in 1 year. Thus features of current (premalignant) nodules remain relevant to predicting malignancy risk in 1-year.

Sample size
In total, there were 13,654 abnormal LDCTs in the first and second screening rounds of the NLST that did not result in a lung cancer diagnosis (7). Of these, 12,993 were used to develop the LCRAT+CT (3), as anyone who did not have a screen at the following screening round was excluded (N=661). Of these, 10,831 were identified as having a nodule with maximum diameter ≥5mm by Optellum, and therefore have an LCP-CNN score, and are included in our analysis dataset. We note that in the NLST meta-data, the largest nodule on 12% of these screens had a mean diameter of 4mm.

Results by nodule size
When 5% of cancers are delayed in diagnosis, LCRAT+CT can assign more small nodules (4-5mm as reported in the NLST metadata) to biennial screening (56.1% vs 47.7% for LCP-CNN, p<0.001), whereas LCP-CNN assigned more larger nodules (≥8mm) to biennial screening (15.7% vs 3.6%, p<0.001). When 10% of cancers are delayed in diagnosis, LCP-CNN's increased ability to assign biennial screening is especially notable for large nodules (≥11mm diameter) compared to LCRAT+CT (27.1% vs 1.8%, p<0.001). Results in which thresholds of 20% and 35% of cancers are delayed in diagnosis are also shown in Table S1.

Sensitivity analysis: Restricted to one screen per person
When the sample was restricted to one screen per person, a total of 7,495 screens were included in the analysis (3,704 from T0 and 3,791 from T1), with 158 screen-detected lung cancers. The AUC for the recalibrated LCP-CNN score (0.88) was greater than that of LCRAT+CT (0.79, p<0.001) and LungRADS (0.71, p<0.001), and did not differ from the combined LCRAT+LCPCNN (0.88, p=0.1449).
Lung-RADS would assign 64% of abnormal presumed non-malignant screens to biennial screening at a threshold of ≤2. LCP-CNN